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1  INTRODUCTION 


This  report  describes  the  research  findings  of  a  three  year  study 
that  included  the  measurement  and  statistical  modeling  of  computer 
reliability  as  affected  by  system  activity.  The  ultimate  aim  of  this 
research  has  been  to  develop  fundamental  concepts  that  can  be  used  to 
increase  the  reliability,  availability,  and  throughput  of  computer 
systems.  During  this  three  year  period,  the  major  research  findings 
have  been  the  following: 

*  Established  existence  of  a  failure/load  relationship,  that  is, 
the  more  a  computer  is  used,  the  more  it  fails  [Sec.  23. 

*  Demonstrated  the  effect  of  more  complex  program  interactions 
(which  result  from  an  increase  in  data  processing)  as  a  major 
cause  of  computer  failure  [Sec.  33. 

*  Identified  major  problem  areas  in  software  error  types  (deadlock, 
data  management,  and  error  handling)  where  improvements  in  system 
recovery  or  fault  tolerance  should  be  made  [Sec.  3 3 . 

*  Demonstrated  the  effect  of  electro-mechanical  device  failures 
as  a  subtle,  but  significant  influence  on  software  failures 
[Sec.  33. 

*  Identifed  device  level  failure  mechanisms  as  explicit  causes 
of  failures  that  are  due  to  heavy  usage  [Sec.  4], 

*  Demonstrated  that  not  only  the  amount  of  activity,  but  also  the 


2 


type  of  activity  affects  the  reliability  of  computer  systems 
tSec.  53. 

*  Established  existence  of  change  in  error  distributions  and/or 
activity  rates  prior  to  a  crash.  This  change  could  provide  a 
basis  for  development  of  a  technique  for  predictior  of  computer 
failures  [Sec.  63. 


The  research  findings  are  discussed  in  more  detail  in  Sections  2-6. 
The  appendix  contains  a  list  of  publications  during  the  three  year 
period,  as  well  as  a  list  of  participating  scientific  personnel. 


2  ESTABLISH  FAILURE/LOAD  RELATIONSHIP 


- A  _  «.  4 - n 

i  tri.ca  o xuiio-ix  p 


K*» 


of  oH  of  4  r»  -a  1 


analysis  of  data  collected  over  a  period  of  eight  years  from  three 
generations  of  computers.  These  computers  were  installed  at  two 
different  physical  locations  and  were  dedicated  to  different  types  of 
applications.  To  establish  the  dependence  of  system  failures  on  the 
amount  of  activity  or  load  on  the  system,  comparisons  were  made  between 
the  different  generations  of  computers,  as  well  as  between  identical 
computers  running  different  applications. 


2.1  PRIME  TIME  VS.  SYSTEM  CRASHES 

Initial  interest  at  Stanford  University  in  the  relationship  of 
utilization  level  and  system  reliability  began  with  the  study  of 
[Beaudry  783  which  developed  a  model  for  failure  of  computing  systems 


with  varying  workload.  The  model  was  based  on  statistical  analyst 
data  from  a  Triplex  multiprocessor  at  Stanford  Linear  Accelerator  Center 
(SLAC).  This  system  used  three  separate  central  processing  units 

(CPUs):  two  IBM  System/370  Model  1 68 a  and  an  IBM  Sy3tem/360  Model  91. 
First  it  was  found  that  a  disproportionate  number  of  service 
interruptions  (system  failures)  occurred  during  prime  time  when  system 
utilization  is  higher.  The  results  are  3hown  in  Table  2.1.  Then  system 
failures  and  job  arrivals  were  compared,  and  the  statistical  evidence 
also  pointed  to  a  definite  relationship  between  heavy  demand  on  system 
resources  and  the  system  failure  rate.  As  a  result  of  this  study,  it 
became  clear  that  a  constant  failure  rate  was  no  longer  valid  in  a 
fluctuating  load  environment.  (This  was  confirmed  later  by  [Castillo 
80,81]  at  Carnegie-Mellon  University,) 

Table  2.1.  Higher  crash  frequency  during  prime  time. 


Sourco  of  System  Failure 

Number  Occurring 

During  Prime  Time 

All 

239 

62% 

Software 

155 

68% 

Hardware 

58 

60% 

Miscellaneous 

23 

<1H% 
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2.2  ANALYSIS  OF  ADDITIONAL  FACTORS  IN  RELIABILITY 


A  subsequent  study  was  performed  or  UNILOG  data  from  the  same 
Triplex  multiprocessor  at  SLAC,  but  additional  factors  were  analyzed  in 
order  to  more  accurately  evaluate  system  reliability  [Butner  80]  [Iyer 
82a],  Not  only  did  this  analysis  seek  to  find  more  indicators  of  system 
load  than  Just  Job  arrivals,  but  it  also  analyzed  more  types  of  failures 
than  just  those  failures  which  caused  service  interruptions  or  brought 
the  system  down. 

Overall  computer  load  is  a  nulti-dimensional  quantity  with  many 
parameters  that  indicate  utilization.  Some  of  these  affect  the  failure 
rate  more  than  others.  System  utilization  and  performance  data  wa3 
analyzed,  and  it  was  found  that  paging  was  the  strongest  single 
utilization  factor  related  to  failures.  An  increased  level  of 
concurrency  implies  an  increased  usage  of  hardware  and  software  paths, 
so  it  was  not  unexpected  to  find  a  strong  relationship  between  paging 
and  failures  for  hardware  and  a  similar,  but  less  strong,  relationship 
for  software.  A  profile  of  paging  at  SLAC  is  shown  in  Figure  2.1  and 
component  failure  profiles  foi  hardware  and  software  in  Figure  2.2. 
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2.3  FAILURE/LOAD  RELATIONSHIP  ON  DIFFERENT  COMPUTERS 


Since  these  preliminary  studies  showed  a  correlation  between 
failures  and  system  utilization,  it  was  uecessa -y  to  demonstrate  whether 
this  correlation  would  exist  for  other  computers  and  other  computing 
environments.  Consequently,  a  comprehensive  statistical  analysis  was 
made  on  data  collected  on  m  IBM  3033  which  was  installed  on  the 
Stanford  University  campus  at  the  Information  Technology  Services  (ITf, 
formerly  known  as  CIT)  [Iyer  82b].  Differences  in  computer  redundancy 
and  applications  which  exist  at  this  facility  made  it  a  good  choice  for 
comparing  data  from  ITS  with  that  of  the  SLAC  facility.  Fortunately, 
the  reports  generated  by  the  SLAC  and  ITS  systems  contain  similar 
information.  System  utilisation  data  came  from  the  IBM  System 
Management  Facility  (C.ir)  records,  and  the  failure  data  came  from  the 
operator-maintained  database  called  UNI'DG.  This  data  was  examined  for 
the  same  three  years  at  both  locations. 

The  preliminary  study  showed  that  all  system  i  allures  correlated 
with  load,  therefore,  it  was  Important  to  determine  whether  this  was 
true  for  different  component?  off  a  system,  as  wel1  as  the  system  as  a 
whole.  The  frequency  of  failures  for  different  component  and  usage 
groups  for  both  systems  revealed  a  strong  statistical  dependency  of 
component  failure  rates  on  several  common  measures  of  utilization  (CPU 
utilization,  I/O  initiation,  paging,  and  job-step  initiation  rates.) 
This  relationship  existed  for  electrical  and  mechanical,  as  well  as 
software  components  [Iyer  82b], 
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Comparisons  between  the  two  systems  revealed  interesting 
differences:  for  example,  although  SLAC  components  were  older  and  were 
more  prone  to  failure  than  those  at  ITS,  the  SLAC  system  was  more 
reliable  (MTBF  23.2  hburs  vs.  17.7  at  ITS).  This  was  attributed  to  the 
fact  that  at  SLAC  a  great  deal  of  the  computational  load  was  served  by 
processors  that  act  as  batch  stream  servers.  Failures  within  these 
machines  do  not  affect  the  rest  of  the  system,  which  accounts  for  the 
relatively  high  fault  tolerance  of  the  SLAC  facility. 

2.4  NACHINE  AND  MANUAL  RECORDED  FAILURE  DATA  SUBSTANTIATION 

So  far  in  these  studies,  failure  data  had  been  obtained  from 
manually  recorded  files.  However,  there  is  an  automatic  recording  of 

U»«.  J.  ^  m  4  m  e  1  Aff  A  «1  1  m/4  I  rtC  OPf*  iAfTIH  A  f* 
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these  errors  are  corrected  by  retry  or  redundancy,  their  presence  would 
remain  unknown  except  for  the  record  in  LOGREC.  To  obtain  further 
verification  of  the  failure/load  relationship  by  looking  at  the  internal 
behavior  of  the  machines,  the  next  study  used  a  portion  of  this  mactine- 
collected  data  on  failures  (CPU  errors)  and  correlated  it  with  workload 
data  from  the  usual  source  (SKF)  and  from  a  software  monitor  [Rossetti 


The  purpose  of  the  monitor,  which  was  written  especially  for  this 
project,  is  to  obtain  detailed  information  about  transient  behavior  in 
the  CPU.  The  reason  it  was  important  to  examine  the  CPU  error 
generation  process  is  that  a  large  portion  of  these  errors  were 
suspected  of  being  transient  or  intermittent  and  very  little  was  known 
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about  their  behavior.  The  study  showed  that  95%  of  the  CPU  errors  were 
"soft"  errors,  that  is,  those  from  which  the  system  recovered,  and  that 
these  errors  also  exhibited  a  dependency  on  system  activity.  (Nearly 
90%  of  field  errors  are  believed  to  be  of  this  type  [Ball  69].)  By 
merging  the  error  data  with  the  load  data,  development  of  a  load-hazard 
model  for  CPU  errors  was  possible.  The  model  was  validated  by  seeding 
errors  in  an  artificially  created  data  base.  Details  of  the  experiment 
and  of  the  monitor  may  be  found  in  [Iyer  83a].  Measurement  and  modeling 
of  hard  CPU  failures  and  system  activity  is  described  in  [Iyer  84b], 
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2.5  CONFIRMATION  BY  DATA  FROM  A  THIRD  MAINFRAME 


A  subsequent  analysis  performed  on  software  failure  data  from  a 
third  type  of  fflaiiifrasie  (an  IBM  SOS'!)  provided  further  confirmation  of 
the  failure/load  dependency  [Rossetti  82],  Figure  2.3  shows  a  histogram 
of  software  failures  by  hour  of  day  from  the  accumulated  data  analyzed. 
This  study  also  demonstrated  that  the  risk  of  software-related  failure 
increases  in  a  non-linear  fashion  with  the  percentage  of  interactive 
processing  (as  measured  by  parameters  such  as  the  paging  rate,  system 
overhead,  etc.).  This  was  the  first  indication  that,  not  only  the 
amount  of  system  activity,  but  also  the  type  of  system  activity 
influences  the  reliability  of  computers.  The  software  errors  most 
frequently  identified  with  system  failures  fell  into  three  major 
categories:  error  handling,  logic,  or  hardware-related.  It  is 

interesting  to  see  that  errors  in  the  code  which  provides  fault 
tolerance  in  the  form  of  exception  handling  built  into  the  software  is 


in  itself  a  frequent  cause  of  errors 


Figure  2.3.  Accumulated  software  failures  by  hour  of  day. 


3  ANALYSIS  OF  SOFTWARE  ERRORS  AND  FAULT  TOLERANCE 

The  motivation  for  this  study  came  from  the  analysis  by  [Beaudry 
77-793,  which  showed  that  software  errors  account  for  more  system 
failures  than  hardware  errors,  and  by  [Rossetti  82],  which  delineated 
the  software  errors  by  type  and  identified  those  most  frequently 
associated  with  system  failures.  The  aim  of  this  part  of  the  research 
was  to  assess  the  fault  tolerance  provided  by  the  computer  system  and  to 
identify  those  areas  where  recovery  procedures  were  ineffective. 

All  software  errors  detected  by  the  operating  system  were 
classified  according  to  error  type  and  then  statistically  analyzed  for 
error  frequency,  effectiveness  of  error  recovery  routines,  and  fault 
tolerance  provided  by  the  software.  (Error  classification  reports  that 


were  used  in  this  study  are  [Endres  751,  [Gannon  83] ,  and  [Thayer  76].) 
The  results  showed  that  memory  allocation  or  addressing,  along  with 
deadlock  problems,  accounted  for  75  percent  of  all  software  errors. 
From  the  analysis  of  the  system  error  detection  and  correction 
capabilities,  estimates  were  made  of  the  fault  tolerance  to  errors  of 
different  types.  The  information  revealed  by  this  study  [Velardi  84] 
pinpoints  major  problem  areas,  i.e.  deadlock,  I/O  and  data  management, 
and  exceptions,  where  improvement  in  the  system  recovery  3hould  be  made. 

In  a  second  study  [Iyer  84c],  hardware  and  software  interaction  was 
examined,  with  particular  emphasis  on  detection  and  correction  of 
software  errors  that  are  related  to  temporary  and  permanent  hardware 
problems.  It  was  found  that  recovery  from  hardware-related  software 
errors  is  less  likely  than  from  other  software  errors.  The  statistical 
information  about  error  patterns  will  be  invaluable  in  developing 
detection  and  recovery  schemes  for  hardware-related  software  errors, 
since  the  specific  questions  of  the  interaction  between  hardware  and 
software  i3  a  subject  that  ha3  not  received  careful  study  in  the  past. 

4  FAILURE  MECHANISMS  AND  UNCERTAINTY  FACTORS 

To  better  understand  the  factors  that  cause  the  increase  in 
hardware  failures  when  system  activity  increases,  the  effects  of  the 
switching  rate  on  device  reliability  were  studied.  It  was  found  that 
the  higher  the  system  activity,  the  greater  is  the  risk  of  failure  due 
to  thermal  effects  and  electromigration  at  the  device  level  [Cortes  84], 
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Programs  for  calculating  the  reliability  of  fault-tolerant  systems 
do  not  explicitly  take  into  account  the  effect  of  failures  in  the 
hardware  switching  mechanism.  Incorporation  of  switch  failures  in 
reliability  modeling  of  redundant  systems  was  studied  and  is  outlined  in 
[Amer  86],  Another  study  showed  that,  due  to  the  effect  of  uncertainty 
in  failut e  rates,  memory  unreliability  increases  and  may  even  double  in 
very  highly  reliable  systems  [Iyer  83b].  As  a  result  of  these  findings, 
the  possibility  of  developing  suitable  techniques  which  incorporate  an 
uncertainty  factor  in  failure  rate  estimations  should  be  investigated. 

5  FACTORS  AFFECTING  OPERATING  SYSTEM  RELIABILITY 

One  objective  of  this  study  was  to  follow  up  and  expand  the 
investigations  reported  in  [Velardi  84]  and  [Iyer  84c, 85].  This 
decision  was  made  because  the  IBM  3081  used  in  these  studies  was 
upgraded  to  a  Model  K  (with  installation  of  additional  hardware)  and  wa3 
running  under  a  newer  operating  system,  the  MVS/XA.  In  addition,  the 
study  was  expanded  to  include  data  from  another  identical  computer,  not 
previously  monitored,  which  had  an  entirely  different  type  of 
utilization.  In  this  way,  two  major  comparisons  could  be  maae.  First, 
the  two  IBM  3081  systems  running  under  MVS/XA  were  compared  to  show  how 
the  type  of  utilization  affects  their  software  and  hardware  error 
behavior.  Second,  new  data  from  the  system  previously  monitored  were 
compared  to  those  reported  for  the  same  system  when  it  was  running  under 
the  MVS/SP  operating  system  [Velardi  84],  Full  results  of  this  study 
are  reported  in  [Mourad  85a, b]. 
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5.1  COMPARISON  OF  TWO  TYPES  OF  UTILIZATION 

The  results  of  this  analysis  definitely  showed  there  was  a 
dependency  of  system  errors  on  type  of  system  utilization.  One  system 
studied  is  used  mainly  to  run  an  interactive  program  to  update  library 
acquisitions.  The  errors  encountered  with  this  system  were  related  to 
disk  problems  and  storage  management.  The  second  system  has  a  varied 
utilization  that  includes:  word  processing,  statistical  packages, 
administrative,  and  research  programs.  Deadlock  and  addressing 
exceptions  were  the  major  problems  reported.  The  errors  of  the  first 
system  reflect  the  uniform  use  of  one  major  interactive  program,  while 
the  errors  of  the  second  result  from  the  more  varied  usage. 

5.2  COMPARISON  BETWEEN  TWO  OPERATING  SYSTEMS 

By  comparing  the  performance  of  the  same  computer  under  MVS/SP  and 
under  MVS/XA,  the  following  conclusions  were  made: 

*  Deadlock  and  I/O  managment  were  still  the  main  problem  areas  for 
the  recovery  system. 

*  Storage  exception  errors  are  more  frequent  under  the  newer 
MVS/XA  operating  system  and  is  probably  due  to  the  bimodal 
addressing  implemented  on  MVS/XA. 

*  Software  errors  are  still  more  frequently  detected  by  software 
than  by  the  hardware. 

*  Lost  records  are  definitely  fewer  which  indicates  a  more 
reliable  system  for  recording  errors. 

*  The  MVS/XA  system  is  more  fault  tolerant  than  the  KVS/SP. 
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6  FAILURE  PREDICTION 

Failure  prediction  provides  a  new  and  innovative  direction  toward 
improving  system  reliability  and  availability.  It  is  a  crucial  factor 
in  providing  failure  prevention  and  system  fault  tolerance. 
(Preliminary  work  in  this  area  has  also  been  started  at  Carnegie  Mellon 
University  [Siewiorek  84].)  The  problem  is  that  a  reliable  basis  must 
be  found  on  which  to  make  a  prediction.  During  investigation  of  the 
effect  of  system  activity  on  computer  systems  reliability,  however,  it 
became  apparent  there  was  an  increase  in  the  number  of  errors 
immediately  before  a  crash.  As  a  result  of  this  observation,  research 
on  the  possibility  of  predicting  failures  (based  on  a  prior  increase  of 
errors)  was  started. 

A  statistical  analysis  was  done  on  all  categories  of  hardware  and 
software  errors  automatically  recorded  by  the  IBM  3081  system  at  ITS  for 
a  period  of  six  months.  The  results  show  that  the  rate  of  generation  of 
certain  errors,  namely  those  from  failing  disk  drives  and  pending 
interrupts,  appears  to  increase  monotomcally  right  before  the 
occurrence  of  a  crash  and,  therefore,  might  be  ured  to  predict  a  crash 
in  advance.  However,  other  error  types  have  an  abrupt  increase  of 
errors  before  a  crash  (and  display  no  discernable  pattern)  but  may 
indicate  certain  threshold  characteristics  that  would  be  useful  in 
failure  prediction.  Figure  6.1  shows  a  histogram  of  the  monotonic 
increase  in  errors  before  a  crash,  and  Figure  6.2  shows  an  abrupt 
increase.  The  interval  of  time  in  both  histograms  is  from  the  time  the 
system  was  last  started  to  the  time  of  the  crash. 
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Figure  6.1.  Monotonic  increase  in  errors  before  a  crash. 
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Figure  6.2.  Abrupt  increase  in  errors  before  a  crash. 


Details  of  the  methodology  for  analysis  of  failure  prediction  data 
developed  during  this  research  have  been  reported  in  [Nassar  85a,b], 
The  first  step  of  the  methodology  was  to  characterize  a  d  cluster 
crashes  to  find  appropriate  intervals  of  time  to  analyze  between  system 
restart  and  a  crash.  The  next  steps  involved  averaging  and  weighting 
error  distribution  data,  analyzing  individual  intervals  between  crashes, 
ard  analyzing  CPU  utilization  rates  prior  to  crashes.  The  results  of 
the  analysis  demonstrated  that  there  are  certain  recurrent  patterns  in 
the  distribution  of  errors  before  a  crash.  In  addition,  analysis  of 
system  utilization  rates  prior  to  a  crash  indicated  existence  of  a 
relationship  between  high  utilization  and  frequency  of  system  failures. 
These  preliminary  results  indicated  that  failure  prediction  based  on  a 
c,  jibination  of  one  or  more  of  these  factors  may  be  feasible. 
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